Skip to content

.Net: Fix TextChunker orphan chunk token counting#14013

Open
MukundaKatta wants to merge 1 commit into
microsoft:mainfrom
MukundaKatta:codex/textchunker-token-overlap
Open

.Net: Fix TextChunker orphan chunk token counting#14013
MukundaKatta wants to merge 1 commit into
microsoft:mainfrom
MukundaKatta:codex/textchunker-token-overlap

Conversation

@MukundaKatta
Copy link
Copy Markdown

Motivation and Context

Fixes #13713.

TextChunker.ProcessParagraphs used word counts when deciding whether to glue a small final/orphan paragraph back into the previous paragraph. With a custom token counter, that could merge two paragraphs whose actual token count exceeds maxTokensPerParagraph, producing an oversized final chunk.

Description

This changes the orphan merge check to build the candidate merged paragraph and evaluate it with GetTokenCount(...), so the same token-counting logic controls both splitting and final orphan gluing. It also adds a regression test using a custom length-based token counter where the previous word-count check would have produced an oversized merged chunk.

Contribution Checklist

Local verification: git diff --check passes. I could not run dotnet test dotnet/src/SemanticKernel.UnitTests/SemanticKernel.UnitTests.csproj --filter FullyQualifiedName~TextChunkerTests --no-restore because this environment does not have the dotnet CLI installed.

@moonbox3 moonbox3 added .NET Issue or Pull requests regarding .NET code kernel Issues or pull requests impacting the core kernel labels May 15, 2026
@github-actions github-actions Bot changed the title Fix TextChunker orphan chunk token counting .Net: Fix TextChunker orphan chunk token counting May 15, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes a bug in TextChunker.ProcessParagraphs where the “orphan paragraph” merge decision used word counts instead of the configured token-counting logic, which could produce a merged paragraph exceeding maxTokensPerParagraph when a custom tokenCounter is supplied.

Changes:

  • Update orphan-merge logic to evaluate the merged candidate using GetTokenCount(...) (consistent with the rest of the splitting flow).
  • Remove the now-unused s_spaceChar constant from TextChunker.
  • Add a regression unit test using a length-based custom token counter to ensure orphan chunks are not merged beyond the token limit.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File Description
dotnet/src/SemanticKernel.Core/Text/TextChunker.cs Uses token counting (via GetTokenCount) to validate orphan-paragraph merges, preventing oversized merged chunks with custom token counters.
dotnet/src/SemanticKernel.UnitTests/Text/TextChunkerTests.cs Adds a regression test covering the orphan-merge scenario with a custom token counter.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Automated Code Review

Reviewers: 4 | Confidence: 92% | Result: All clear

Reviewed: Correctness, Security Reliability, Test Coverage, Design Approach


Automated review by MukundaKatta's agents

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

kernel Issues or pull requests impacting the core kernel .NET Issue or Pull requests regarding .NET code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bug: the TextChunker.SplitPlainTextParagraphs sometimes overcount the chunk sizes

3 participants